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Abstract 

The task of similarity search in multimedia databases 
is usually accomplished by range or k nearest neigh- 
bor queries. However, the expressing power of these 
"single-example" queries fails when the user's delicate 
query intent is not available as a single example. Re- 
cently, the well-known skyline operator was reused in 
metric similarity search as a "multi-example" query 
type. When applied on a multi-dimensional database 
(i.e., on a multi-attribute table), the traditional sky- 
line operator selects all database objects that are not 
dominated by other objects. The metric skyline query 
adopts the skyline operator such that the multiple at- 
tributes are represented by distances (similarities) to 
multiple query examples. Hence, we can view the 
metric skyline as a set of representative database ob- 
jects which are as similar to all the examples as pos- 
sible and, simultaneously, are semantically distinct. 

In this paper we propose a technique of processing 
the metric skyline query by use of PM-tree, while we 
show that our technique significantly outperforms the 
original M-tree based implementation in both time 
and space costs. In experiments we also evaluate the 
partial metric skyline processing, where only a con- 
trolled number of skyline objects is retrieved. 



1 Introduction 

As the volumes of complex unstructured data collec- 
tions grow almost exponentially in time, the attention 
to content-based similarity search steadily increases. 
The concept of numeric similarity between two data 
entities is one of the approaches used in querying un- 
structured data, where a similarity function serves as 
multi- valued relevance of data objects to a query (ex- 
ample) object. The content-based similarity search 
paradigm has been successfully employed in areas like 
multimedia databases, time series retrieval, bioinfor- 
matic and medical databases, data mining, and oth- 
ers. At the same time, the "similarity-centric" view 
on such data demands specific alternative techniques 
for modeling, indexing and retrieval, which dramat- 
ically differ from the traditional approaches to man- 
agement of structured data (e.g., B-trees in relational 
databases). 

In the rest of the section we introduce into the fun- 
damentals of similarity search and briefly summarize 
the paper contributions. 

1.1 Similarity search 

Given a collection C of unstructured data entities 
(e.g., multimedia objects, like images), to query the 
collection we need to establish a model consisting of 
the object universe U, a transformation function (a 
feature extraction method, resp.) t : C — » U, and a 



similarity function 5 : U x U — > 1Z. The transforma- 
tion t turns the collection C of original data entities 
into a database of descriptors § C U. In most cases 
the similarity function <5 is expected to be a metric 
distance, because metric properties can be effectively 
used to index the database S for efficien t (fa st) query 
processing, as discussed later in Section [O] 

1.1.1 Single-example queries 

The portfolio of available similarity query types con- 
sists of mostly single-example queries. The range 
query and k nearest neighbor (kNN) query represent 
the two most popular similarity query types. Using 
a range query (Q,tq) we ask for all objects Oi € § 
the distances of which to a single query object Q are 
at most tq. On the other hand, a kNN query (Q, k) 
selects the k database objects closest to Q. 

Besides range and kNN queries, there exist some 
less frequently used query types, like reverse (k)NN 
queries, returning those database objects having the 
query object Q within their (k) nearest neighbor (s). 

1.1.2 Multi-example queries 

Although the single-example queries are frequently 
used nowadays, their expressive power may become 
unsatisfactory in the future due to increasing com- 
plexity and quantity of available data. The acquire- 
ment of an example query object is the user's "ad- 
hoc" responsibility. However, when just a single query 
example should represent the user's delicate intent on 
the subject of retrieval, finding an appropriate exam- 
ple could be a hard task. Such a scenario is likely to 
occur when a large data collection is available, and so 
the query specification has to be fine-grained. Hence, 
instead of querying by a single example, an easier 
way for the user could be a specification of several 
query examples which jointly describe the query in- 
tent. Such a multi-example approach allows the user 
to set the number of query examples and to weigh 
the contribution of individual examples. Moreover, 
obtaining multiple examples, where each example cor- 
responds to a partial query intent, is much easier task 
than finding a single "holy-grail" example. 

As for existing solutions to multi-example query 
types, we distinguish three directions. First, there 
exist many model- specific techniques based on an ag- 
gregation or unification of the multiple examples, e.g., 
querying by a centroid in case of vectors, or by a 
union, intersection or ot her composition of fe atures 
of the query examples ( Tang fc Acton 2003). Al- 
though successful in a narrow context (e.g, in image 
retrieval), this approach is not applicable to the gen- 
eral (metric) similarity case. 

Second, a popular approach to multi-query exam- 
ple is issuing multiple single-example queries, while 
the resulting multiple ranked lists are aggregated by 
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trary aggregation function which provides an impor- 
tant add-on to the expressive power of querying. The 
main drawback of top-k queries is their high com- 
putational and space cost - there have to be several 
single-example queries issued and their results mate- 
rialized and aggregated. 

Third, browsing is a retrieval modality where the 
user plays an important role in the querying process. 
The user incrementally issues single-example queries, 
while from the query result of the previous query the 
user select a new example. Since there are multiple 
examples used before the user achieves her/his goal, 
we could interpret browsing as multi-example query 
processing. The drawback of browsing is obvious, the 
user is bothered by the enforced interactivity, while 
the subsequent simple queries may not lead to a sat- 
isfactory result anyways. 

As other complex operat ions we name similarity 
joins ( Jacox fc Samet||2008[ ), joining pairs of objects 
(from one or more databases) based on their proxim- 
ity, or a special case of the similarity join - the clos- 
est pair operator, selecting the two closest objects in 
the database(s). However, even though the similarity 
joins provide a complex retrieval functionality, again, 
they can be regarded as a series of single-example 
queries, rather than a regular multi-example query 
type. 

In this paper we deal with metric skyline query 
(detailed in the Section [2]) , which represents a "na- 
tive" multi-example query type. 

1.2 Metric Access Methods 

When the similarity function 6 is a distance met- 
ric, the metric access methods (MAMs) can be used 
for efficient (fast) similarity query processing ( Zezula 
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Chavez et al.||2001| ) 
MAlVis is the utilization of met- 
ric postulates (positiveness, symmetry and triangle 
inequality), which allow to partition the data space 
into equivalence classes of close (similar) data ob- 
jects. The classes are embedded within a data struc- 
ture which is stored in an index file, while the index is 
later used to quickly answer range, kNN, or other sim- 
ilarity queries. In particular, when issued a similarity 
query, the MAMs exclude many non-relevant equiva- 
lence classes from the search (based on metric proper- 
ties of S), so only several candidate classes of objects 
have to be exhaustively (sequentially) searched, see 
Figure [II In consequence, searching a small number 
of candidate classes turns out in reduced costs of the 
query. 

The number of distance computations S(-, •) is con- 
sidered as the major component of the overall costs 
when indexing or querying a database. Some other 
cost components (like I/O costs, internal CPU costs) 
could be taken into consideration when the computa- 
tional complexity of 6 is low. 
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Figure 1: Metric access methods. 

In the following we briefly describe the M-tree and 
the PM-tree, two MAMs used further in the paper for 
implementation of metric skyline queries. 



The M-tree (Ciaccia et al. 19971 is a dynamic met- 
ric access method that provides good performance in 
database environments. The M-tree index is a hi- 
erarchical structure, where some of the data objects 
are selected as centers (references or local pivots) of 
ball-shaped regions, and the remaining objects are 
partitioned among the regions in order to build up a 
balanced and compact hierarchy, see Figure [2] 
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Figure 2: (a) M-tree (b) Basic filtering (c) Parent 
filtering. 

Each region (subtree) is indexed recursively in a B- 
tree-like (bottom-up) way of construction. The inner 
nodes of M-tree store routing entries 

rout M (R) = [R,r R ,S(R,P&i(R)),ptr(T(R))] 

where R is a data object representing the center of the 
respective ball region, r R is a covering radius of the 
ball, S(R, Par(R)) is so-called to-parent distance (the 
distance from R to the object P of the parent rout- 
ing entry), and finally ptr(T(R)) is a pointer to the 
entry's subtree T(R). In order to correctly bound the 
data in T(i?)'s leaves, the routing entry must satisfy 
the nesting condition: \/0. L € T(R),r R > 5{R,Oi). 
The data is stored in the leaves of M-trce. Each leaf 
contains ground entries 

grnd M {D) = [D, id(D),S(D, Par(D))] 

where D is the data object itself (externally identified 
by id(D)), and 8(D, Par(D)) is, again, the to-parent 
distance. See an example of entries in Figure [2k. 

The queries are implemented by traversing the 
tree, starting from the root. Those nodes are ac- 
cessed, the parent regions of which are overlapped by 
the query region, e.g., by a range query ball {Q,ro). 
The check for region- and-query overlap requires an 
explicit distance computation S(R,Q) (called basic 
filtering). In particular, if 5(R,Q) < tq + r^, the 
data ball (R, tr) overlaps the query (Q, rqV thus the 
child node has to be accessed, see Figurej2]3. If not, 
the respective subtree is filtered from further process- 
ing. Moreover, each node in the tree contains the dis- 
tances from the routing/ground entries to the center 
of its parent routing entry (the to-parent distances). 
Hence, some of the M-tree branches can be filtered 
without the need of a distance computation, thus 
avoiding the "more expensive" basic overlap check. In 
particular, if \S(P, Q) — 6(P, R)\ > tq + r#, the data 
ball R cannot overlap the query ball (called parent fil- 
tering), thus the child node has not to be re-checked 
by basic filtering, see Figure [2fc. Note S(P,Q) was 
already computed at the unsuccessful parent's basic 
filtering. 

1.2.2 PM-tree 



The idea of PM-tree ( |Skopal| |2004[ |Skopal et a! 
2005 1 is to enhance the hierarchy of M-tree by an 



information related to a static set of p global pivots 
Pi G V C U. In a PM-tree's routing entry, the origi- 
nal M-trcc-inhcrited ball region is further cut off by a 
set of rings (centered in the global pivots), so the re- 
gion volume becomes more compact - see Figure [3k. 
Similarly, the PM-tree ground entries are enhanceclby 
distances to the pivots, which are interpreted as rings 
as well. Each ring stored in a routing/ground entry 
represents a distance range (bounding the underlying 
data) with respect to a particular pivot. 

A routing entry in PM-tree inner node is defined 

as: 

rout PM (R) = [R,r R ,S(R,Pax(R)),ptr(T(R)),BR], 

where the new HR attribute is an array of phr in- 
tervals {phr < p) i where the t-th interval HRp t is the 
smallest interval covering distances between the pivot 
P t and each of the objects stored in leaves oiT(R), i.e. 
HR Pt = (HR™™, HR™ aa; ), HR™ m = min{S(Oj, P t )}, 
HR™ 1 = max{6(0 3 ,P t )}, VOj G T(R). The inter- 
val HRp t together with pivot Pt define a ring region 
(P t ,HR Pt ); a ball region (P t ,HR™ aa: ) reduced by a 

"hole" (P t ,HR™ m ). 

A ground entry in PM-tree leaf is defined as: 

grnd PM (D) = [D,id(D),S(D,Pax(D)),PD], 

where the new PD attribute stands for an array of 
Ppd pivot distances (p p d < p) where the t-th distance 
PDp t = <5(P,P t ). 

The combination of all the p entry's ranges pro- 
duces a p-dimensional minimum bounding rectangle 
(MBR), hence, the global pivots actually map the 
metric regions/data into a "pivot space" of dimen- 
sionality p (see Figure p3p) . The number of pivots can 
be defined separately ior routing and ground entries 
- we typically choose less pivots for ground entries to 
reduce storage costs (i.e., p — phr > Ppd)- 
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Figure 3: (a) PM-tree employing 2 pivots (Pi, P2). 
(b) Projection of PM-tree into the "pivot space" . 

When issuing a range or kNN query, the query ob- 
ject is mapped into the pivot space - this requires 
p extra distance computations <5(Q, Pi), VP; G V. 
The mapped query ball {Q,tq) forms a hyper-cube 
(8{Q,P X ) - r Q ,S{Q,P{) + r Q ) x ... x (S(Q,P P ) - 
r Q> $(Q, Pp)+ r g) m the pivot space that is repeatedly 
utilized to check for an overlap with routing/ground 
entry's MBRs (see Figures [3^, b) . If they do not over- 
lap, the entry is filtered out without any distance 



computation, otherwise, the M-tree's filtering steps 
(parent & basic filtering) are applied. Actually, the 
MBRs overlap check can be also understood as 
filtering, that is, if the Loo distance^ from a PM-tree 
region to the query object Q is greater than tq, the 
region is not overlapped by the query. 

Note the MBRs overlap check does not require an 
explicit distance computation, so the PM-tree usually 
achieves significantly lower query costs when com- 
pared with M-tree - up to an order of m agnitude (see 
QSkopalpOOll [20071 |Skopal et al.p005| >). 



1.3 Paper Contributions 

In this paper we introduce metric skyline processing 
by use of PM-tree, which is a metric access method 
suitable for similarity sea rch in large databas es. We 
follow the pioneer work ( |Chen fe~ Lian 2008) where 
the concept of metric skyline query was introduced, 
and its implementation utilizing M-tree was proposed. 
In Section [2] the metric skyline query and its origi- 
nal implementation is discussed, while in Section 
we propose our original PM-tree implementation o: 
metric skyline processing. In experimental results 
(Section BJ we show that PM-tree based metric sky- 
line processing outperforms the original M-tree imple- 
mentation not only in terms of distance computation 
costs, but also in terms of I/O costs, internal CPU 
costs and internal space costs. 



2 Metric Skyline Queries 

In relational databases, the multi-criterial retrieval is 
popular in situations where a query exactly specify- 
ing the desired attribute ranges cannot be effectively 
issued. Instead, there is a need for a simplified query 
concept which selects the desired database objects by 
some aggregation technique. 

Besides the top-k queries ( |Fagin|[T999[ ), a popular 
multi -criterial retrieval tech nique is the skyline oper- 
ator ( Borzsonyi et al.|[2001 ). 



2.1 The Skyline Operator 

The traditional skyline operator is an advanced re- 
trieval technique that selects objects from a multi- 
dimensional database that are "the best" from the 
global point of view. The only assumption on the 
database is that the attribute domains (dimensions) 
are linearly ordered, such that the lower (or higher) 
value of an attribute is, the better the object is (in 
that attribute) . In the rest of the paper we suppose 
the convention that a lower value in an attribute is 
better. 

The skyline operator selects all objects from the 
database (the skyline set), that are not dominated by 
any other object. An object 0\ dominates another 
object O2 if at least one of Oi's attribute values is 
lower than the same attribute in O2, and the other 
attribute values in 0\ are lower or equal to the cor- 
responding attribute values in O2. Hence, 0\ is the 
dominatinqoh]ect , while O2 is the dominated object. 
In Figure [4] see an example of skyline set consisting 
of 5 objects, dominating the remaining 6 objects. 

2.1.1 Skyline Processing 

There exist many approaches to the efficient im- 
plementation of the skyline operator, while we out- 
line t wo of them - the So rt-First Skyline algo- 



rithm (Borzsonyi et al. 2001) and the branch-and 
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by the already determined skyline objects (see Figure 
|5p). If an MDDR is popped that is not dominated, its 
descendants in the SAM hierarchy are fetched and in- 
serted into the heap, otherwise the MDDR is removed 
from further processing. If an object is popped from 
the heap that is not dominated, it is added to the 
skyline set (otherwise filtered out). The correctness 
of this algorithm is guaranteed by the Li ordering of 
the heap. 



Figure 4: A skyline set and the dominated objects. 



bound algorithm which will be useful further in the 
paper. 

In the Sort-First Skyline algorithm, the database 
objects Oi are just ordered ascendentally based on 
the Li norm on attributes (coordinates) of Oi, i.e., 

\\Oi\\ Ll = Oj + 0?-\ hOf. Then, following the Li 

order, the sorted database is passed such that each 
visited object Oi is checked whether it is dominated 
by the already determined skyline objects. If Oi is 
not dominated, it is added to the skyline set (empty 
at the beginning), otherwise, Oi is ignored. After the 
one-pass database traversal is finished, the skyline set 
is complete. 

The above algorithm is correct because of the Li- 
norm ordering. Suppose an object Oi is being pro- 
cessed (see Figure [5a). Because every object possi- 
bly dominating Oj nes in the dominating area, its 
Li norm must be lower than that of Oi. However, 
such an object has already been visited (and possi- 
bly added to the skyline set) because of the ordered 
database traversal. Thus, Oi can be either safely 
added to the skyline set or filtered out. 
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Figure 5: A dominating-dominated (a) object and (b) 
rectangle (MDDR). 

The branch- and-bound approach emplo ys a sya- 
tial access method (SAM), e.g. the R-tree dGuttman] 
1984ft . The database is indexed by the SAM, while 
tor the skyline processing a memory-resident prior- 
ity heap is additionally utilized. The heap priority 
is defined, again, as the Li norm, however, besides 
the database objects themselves, the heap may con- 
tain also minimum bounding rectangles (MBRs, na- 
tively maintained by, e.g., R-tree). For future use 
outside the scope of SAM, we call MBRs as minimum 
dominating- dominated rectangles (MDDRs). The 
MDDRs serve as spatial rectangular approximations 
of the underlying database objects (or nested MD- 
DRs), while they can be effectively used for filtering. 
The order of an MDDR within the heap is defined 
by the Li norm of its minimal corner (the point of 
MDDR with minimal values in all dimensions) , which 
is the maximal lower bound to Li norm of any object 
inside the MDDR. 

The skyline processing starts by inserting the top 
MDDR (within the SAM hierarchy, e.g., R-tree root) 
into the empty heap. Then, in every step an entry, ei- 
ther an MDDR or a database object, is popped from 
the heap, while it is checked whether it is dominated 



2.1.2 Advanced Skyline Queries 

Recently, the concept of skyline operator has been 
generalized to fit dynamic conditions, where the 
database object attributes and/or their valu es are 
not static. For e xample, the dynamic skylines (Papa- 
| dias et aT||2005 1 consider the attributes as dimension 
runctiorisT The s vatial skyline queries ( |Sharifzadeh 
& Shahabi 2006j ) treat the attribute values as dy- 
namically computed Euclidean distances from a set of 
query points (multi-poi nt spatial query) . T he multi- 
source skyline queries (|Deng et al. 2007 1 are simi- 
lar, however, instead of Hiuclidean distances in con- 
tinuous space the multi-source skyline queries use the 
shortest-path distances in a graph (in road network, 
respectively) . As another approac h, the reverse sky- 
line queries ( |Dellis fc Seeger|2007 1 return the objects 
whose dynamic skyline contains the original query ob- 
ject (of the reverse skyline query). 

2.2 Metric Skyline Queries 

The spatial skyline queries were generalized recently 
to support an arbitrary metric distance S (i.e., not 
just Euclidean), constituting thus the me tric skyline 
queries (MSQ) QChen fc Lian][2008l [2009| . 

Generally speaking, the metric skyline model just 
adds an abstract transformation step before the usual 
skyline processing. The step consists of transforma- 
tion of a database in a metric space into database 
in m-dimcnsional vector space through a set Q of 
m — \Q\ query examples. In the second step, 
the traditional skyline operator is performed on the 
transformed database. In particular, a database 
object Oi in the metric space is transformed into 
a vector Vi, where its j-th coordinate is defined 
as the distance from j-th query to Oi, i.e., Vi = 
(S(Q 1 ,O i ),5(Q 2 ,O i ),...,S(Q n ,O i )),Q j e Q. 

2.2.1 Motivation 

The motivation for MSQ can be seen in the insuf- 
ficient expressive power of ran ge and kNN queries, 
as mentioned in Section 1.1.2 Besides the possibil- 
ity of employing multiple query examples, the metric 
skyline query has also another unique property, the 
absence of query extent, i.e., the query is defined just 
by the set Q. This property could be seen as both 
advantage and disadvantage. 

The advantage is that metric skyline query returns 
all distinct objects from the database that are as sim- 
ilar to the query examples as possible. Hence, we 
obtain all such objects; we are freed from tuning the 
precision and recall proportion. When issuing range 
or kNN queries, we have to specify the query extent 
(i.e., the query radius or the number of nearest neigh- 
bors), which could not be as easy as it seems. In par- 
ticular, the definition a range query radius requires an 
expert knowledge of the underlying metric distance, 
otherwise we obtain too small or too large answer 
set. The kNN query is more user- friendly, however, 
the precision/recall problem still remains. 

Unfortunately, the disadvantage of MSQ is the 
skyline set (answer set) size. If m = |Q| = 1 we 



obtain a regular 1-NN query. However, with increas- 
ing m the skyline size usually grows substantially, 
while a skyline set size exceeding several percent of 
the database is usually useless for an end-user. More- 
over, the skyline set is not uniquely ordered (unlike 
range or kNN answers) , so a reduction of the skyline 
set cannot be guided by some internal structure of 
the answer. Thus, to be discriminative enough, the 
metric skyline query should employ only a few query 
examples (say, 2-5). 

2.2.2 M-tree Based Implementation 

The above described straightforward two-step ab- 
straction is not suitable for implementation of MSQ. 
An explicit transformation of the original database 
§ into a metric space would require expensive static 
preprocessing of the database, consisting of \Q\ ■ |S| 
distance computations, extra storage costs, etc. Re- 
member, the main cost component in similarity search 
by MAMs is the number of distance computations, so 
any MSQ algorithm should be designed to avoid com- 
puting as many distances as possible. 

The authors of metric skyline queries proposed a 



nativ e MSQ processing by M-tree ( Chen fc Lian|2008 
2009), where the transformation step is applied only 



on a part of the database that cannot be skipped dur- 
ing the processing. Basically, the M-tree based met- 
ric skyline algorithm was inspired by the traditional 
skyline processing by R-trec and the pr iority heap TL 
under Li norm (as described in Section 2.1.1 1. 

In the followi ng we have re- formulate d the original 



description in ([Chen fc Lian||2008| |2009[ ) to the more 
abstract MDDR formalism, due to its easier extensi- 
bility to our original contribution in Section [3j 

The modification of the traditional R-tree based 
skyline processing to the metric case resides in an 
"on-the-fly" derivation of MDDRs, which cover the 
transformed data objects. Instead of "native" R-tree 
MDDRs (MBRs, resp.), we distinguish two types of 
derived MDDRs in M-tree, as follows: 

• The Par-MDDR (parent MDDR) of a rout- 
ing/ground entry entryiR, r R , • • • ]q constructed 
by use of the parent routing entry rout{P, ■ ■ ■) 
as 

MDDRp ar = (LB% r ,UB% r ) x ••• x 



(LB^ r ,UB^ r ), where LB% r is a lower- 
bound distance from Qi to the region (R,r R ) 
(through its parent P), while UBp' ar is an 
upper-bound distance from Qi to (R,r R ). 
Thus, LBf ar = max(S(Q u P) - (S(P,R) + 
r R ),(S(P,R) - r R ) S(Q u P),0), and 



UB 



Qt 



S(Qi,P) + 5(P,R)+r R . 



• The B-MDDR (basic MDDR), constructed 
directly from a routing/ground entry as 
MDDR B = (S(Qi,R) - r R , S(Qi,R) + r R ) x 
■■•x (6(Q m ,R) - r R ,S(Q m ,R) + r R ). As a 
consequence, B-MDDR of ground entry is a 
single point. 

Obviously, we have chosen the terms "Par- 
MDDR" and "B-MDDR" due to the analogy with 
parent- and basic filtering used when processing a 
range or kNN query in M-tree. The Par-MDDR of 
a routing/ground entry can be derived without an 
explicit distance computation; the 5(Qi,P) distances 
were already computed during the top-down M-tree 
traversal. The derivation of B-MDDR is more expen- 
sive, it requires m computations of S(R, Qi), \/Qi € Q. 



An MDDR Mi dominates all objects inside an 
MDDR Mi if the Li norm of Mi's maximal cor- 
ner is lower than the Li norm of M2S minimal cor- 
ner, where a maximal/minimal corner is the point 
with maximal/minimal values in all dimensions of an 
MDDR. For an example of Par-MDDR and B-MDDR, 
see Figure [6] 
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Figure 6: (a) Metric space with M-tree regions (b) 
Transformed vector space with MDDRs 

The MSQ algorithm starts by inserting routing en- 
tries from the M-tree root into the priority heap TL. 
The heap keeps order given by Li norm applied on 
the entries' B-MDDRs' minimal corners. Then a loop 
follows until the heap gets empty: 

1. An entry entry(R, . . .) with the lowest Li value 
of its B-MDDR is popped from the heap. 

2. If the entry is a ground entry, it is added to the 
set of skyline objects. All entries on the heap 
which are dominated by this new skyline object 
are removed. Jump to Step 1. 

3. If the entry is a routing entry, the entry's child 
node is fetched. The Par-MDDRs of the child 
node's entries are checked for dominance by the 
set of already determined skyline objects, while 
the dominated ones (and the respective subtrees, 
in case of routing entries) are filtered from further 
processing. 

4. The B-MDDRs of the non-filtered child entries 
are derived. Those entries not dominated by the 
already retrieved skyline set are inserted into the 
heap. Jump to Step 1. 

2.2.3 Discussion &: Criticism 



For a ground entry tr 



Unfortunately, i n the original contribution (Chen & 
|Lian|200"8} |2009[ ) the cost analysis and also the exper- 
iments were locused solely on measuring the number 
of dominance checks, i.e., how many times B-MDDRs 
and Par-MDDRs were checked for dominance by a 
skyline object. The authors completely ignored the 
number of distance computations (the crucial perfor- 
mance factor for any MAM), but also the heap size 
and the number of operations on heap, spent by run- 
ning the metric skyline algorithm on M-tree. 

As we present later in experim ental evaluation, the 
above algorithm, as proposed in ( |Chen feTC ian 2008, 
|2009[ ), is extremely inefficient in terms ol trie heap 
size and the number of operations on the heap. In 
fact, the maximal heap size could reach the size of 
the database (!), making such an implementation in- 
applicable in database environments. In the following 
section we introduce a PM-tree based method, which 
not only decreases the number of distance compu- 
tations spent for metric skyline processing, but also 
drastically decreases the maximal heap size and the 
number of operations on the heap. 



3 PM-tree based metric skyline 

The M-tree based approach to metric skyline process- 
ing can be extended to a PM-tree based implemen- 
tation. In the following we introduce an algorithm 
that makes use of the PM-tree's extensions over the 
M-tree - the pivot set V and the respective ring re- 
gions maintained by routing/ground entri es in PM- 
tree nodes (for PM-tree details see Section 1.2.21. 

First of all, when a metric skyline query is started, 
a query-to-pivot matrix of pair-wise distances be- 
tween the PM-tree pivots P{ G V and query examples 
Qi S Q is computed. The PM-tree based implementa- 
tion then utilizes t he following three filtering concepts 
(Sections |3.1| 
in SectionTO 

3.1 Deriving Piv-MDDRs 

Besides the M-tree's B-MDDRs and Par-MDDRs de- 
rived from a routing/ground entry(R, ■ ■ ■ ,HR/PD), 
an additional MDDR can be derived from the set 
of rings HR/PD maintained by the entry, called 
Piv-MDDR (pivot MDDR). The Piv-MDDR can 
be derived using the query-to-pivot matrix, as 
MDDR P „ = {LB%, UB%) x • • • x (LB^ V , UB? 
where 
LB? 



dominance checking, we can determine the so-called 
pivot skyline - those pivot objects, which constitute 
a metric skyline within the pivot set V itself, see an 
example in Figure [8] 



3.3 1, summarized within an algorithm 



Piv 



max Pje p{5(Pj,Qi) - HR£ aa: , HR£ in - 



<$CR,-,Qi),0}, and VB% = min Pj&v {K p hQi) 
HRp"™}. 



Similarly as the M-tree's Par-MDDR, the deriva- 
tion of Piv-MDDR requires no extra distance compu- 
tation, however, Piv-MDDRs are much more compact 
than Par-MDDRs. This results in more effective fil- 
tering of routing/ground entries by skyline objects or 
some dominating MDDRs. Moreover, the Piv-MDDR 
is often even more compact than the direct B-MDDR, 
because the PM-tree's rings reduce the volume of the 
original M-tree's sphere. In Figure 17] see an exam- 
ple of Piv-MDDR, Par-MDDR and B-MDDR, when 
2-pivot PM-tree and 2 query examples are used. 
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Figure 7: A PM-tree routing entry in (a) metric space 
and (b) mapped to Piv-, Par-, and B-MDDR. 

Naturally, when having two or three x-MDDRs 
available, e.g., B-MDDR + Piv-MDDR, we can in- 
tersect them to form a single compact MDDR which 
is then used for filtering. 

3.2 Pivot-Skyline Filtering 

If the pivots Pi come from the database (i.e., Pi G 
P C §), the MDDRs that are about to be inserted into 
the heap can be checked for a dominance by the piv- 
ots. Since the query-to-pivot matrix is computed at 
the beginning of every metric skyline query process- 
ing, the transformation of the pivots into the "query 
space" requires no additional distance computations. 
Moreover, to reduce the number of pivots used for 



C] 



Figure 8: A pivot skyline. 

The filtering by use of pivot skyline is beneficial in 
the early phase of the metric skyline processing, when 
the set of determined skyline objects is still empty. In 
the experiments we show that such an early phase is 
the dominant phase of the entire skyline processing 
- 80-90% of the total distance computations is per- 
formed before the first skyline object is found. Hence, 
pruning the heap by use of the pivot skyline greatly 
helps to reduce the heap size and, consequently, the 
number of operations on the heap. 

3.2.1 Merging Pivot Skyline with the Regu- 
lar Skyline 

As the number of determined skyline objects grows, 
the objects in the pivot skyline become dominated by 
the "regular" skyline objects. Hence, in order to ef- 
fectively use the pivots for dominance checking, we 
keep just those pivots in the pivot skyline, that are 
not dominated by the already determined skyline ob- 
jects. Thus, at the moment when all skyline objects 
are known, the pivot skyline becomes empty. 

3.3 Deferred Heap Processing 

In the original M-tree algorithm, the priority heap 
contains just Li-ordered B-MDDRs (together with 
the associated routing/ground entries). When an en- 
try is to be inserted into the heap, its B-MDDR must 
be d eterm ined, see Steps 3,4 of the algorithm in Sec- 
tion 2.2.2 We call this approach a non-deferred heap 
processing. 

However, the non-deferred heap processing is not 
optimal in terms of the number of distance computa- 
tions. In order to save some distance computations, 
we propose the deferred heap processing for the metric 
skyline, inspired by the Hjaltason's & Samet's incre- 
mental nearest neighbor algorithm, whi ch is optimal 
in the number of distance computations (Hjaltason & 
SametlpOOOt . 



t he modified heap is generalized such that it may 
contain not only B-MDDRs of routing/ground en- 
tries, but also the intersections of their Piv-MDDR 
and Par-MDDR (denoted as Piv-MDDR n Par- 
MDDR). The deferred heap processing then deals 
with two situations: 

• An entry equipped by B-MDDR is popped from 
the heap. Then, 

(a) If the entry is a ground entry, it becomes a 
skyline object. 

(b) If the entry is a routing entry, its child node 
is fetched, while for every entry in the child 
node the Piv-MDDR n Par-MDDR is checked 
for a dominance by the skyline set. Every not- 
dominated child entry is equipped by its Piv- 
MDDR n Par-MDDR and inserted into the heap. 



• An entry equipped by Piv-MDDR n Par-MDDR insert newEntry into n 
is popped from the heap. The entry is checked for } 
a dominance by the skyline set. If not dominated, 
the entry's B-MDDR is determined and, if still 
not dominated, inserted back into the heap. 

Note: The deferred heap processing "gives the 
algorithm a chance" to filter out as many entries as 
possible, without the need of B-MDDR derivation (re- 
quiring explicit distance computations) . On the other 
hand, the "reinsertions" of PM-tree entries into the 
heap (first, equipped by Piv-MDDR n Par-MDDR, 
and second, equipped by B-MDDR) increase the num- 
ber of operations on the heap and also the heap size. 



3.4 The Algorithm 

In Listing [T] the algorithm for metric skyline query 
is presented^ including the original M-tree variant as 
well as the proposed PM-tree extensions. 

The input attribute type allows to set the MSQ 
variant as follows: type = 'M-tree' is the original 
M-tree based algorithm, type — 'PM-tree' is the ba- 
sic PM-tree based algorithm usi ng t he Piv-MDDR 



filtering (as described in Section 3.11, type 
tree+PSF' additionally uses the pivot- skyline filter- 



ing (as described in Section 3.2 1, and type = 'PI 
tree+PSF+DEF' additionally uses the deferred heap 



processing (as described in Section 3.3 1 



Listing 1 (Algorithm of metric skyline query) 



MSQuery() 

{ 

Input: PM-tree VM, query points Q, type ('M-tree', 'PM-tree', 

'PM-tree+PSF', 'PM-tree+PSF+DEF') 
Output: Result containing skyline points 

if (type is not 'M-tree') 
P2Q_DM — evaluate the query-to-pivot matrix 
// pivots must be DB objects 
PSL = evaluate pivot skyline (using P2Q.DM) 

Insert all routing entries + their Piv-MDDR n B-MDDR from the 

PM-tree root into the heap 7i 

while [7i is not empty) 
currentEntry — pop entry from the heap 7i 

if (currentEntry is not equipped by 'B-MDDR') 
FilterAndlnsert(currentEntry, currentEntry, type, true) 

else if (currentEntry is of type ground entry' and is equipped by 

'B-MDDR') 

Insert currentEntry into A4SS 

7-£.FilterDominatedObjectsBy(currentEntry.MDDR) 
PSLFilterDominatedObjectsBy(currentEntry.MDDR) 

else 

N = fetch child node of currentEntry 
for each childEntry in N 
FilterAndlnsert(childEntry, currentEntry, type, false) 

} 

FilterAndlnsertfnewEntry, parentEntry, type, deferred) 

{ 

if (not deferred) 
Equip newEntry by its Par-MDDR 
if (type is not 'M-tree') 
Update newEntry.MDDR by intersection with newEntry's Piv-MDDR 

if (Filter(newEntry, type)) 
return 

if (type = 'PM-tree+PSF+DEF' and not deferred) 
Insert newEntry into 7i 
return 

Equip newEntry by its B-MDDR 

if (Filter(newEntry, type)) 
return 



Filter(newEntry, type) 

{ 

for each Oj in MSS 
if (newEntry.MDDR is dominated by Oi) 
return true 

if (type is 'M-tree' or 'PM-tree') 
return false 

for each Oj in PSL 
if (newEntry.MDDR is dominated by Oi) 
return true 

return false 



3.5 Runtime Properties 

Since the thread of metric skyline processing may gen- 
erally follow many different scenarios (depending on 
the data distribution, metric distance employed, num- 
ber of pivots, number of queries, etc.), the time and 
space costs cannot be exactly determined beforehand. 
Nevertheless, we could observe some properties that 
will (more or less) occur for any set of conditions. 

First, the algorithm of metric skyline query uses 
the priority heap (either the non-deferred or deferred 
variant) and second, an object already inserted into 
the skyline set remains a skyline object forever. The 
second observation gives a clue to the typical heap 
evolution. Because an object is inserted into the sky- 
line set after it is definitely clear it belongs to the 
skyline, one can conclude that a large proportion of 
the entire query logic must be performed before the 
first skyline object is reached. We call this early phase 
an expansion phase, because the heap content cannot 
be pruned by the dominating skyline objects (they 
do not exist yet), and so the heap size only grows 
(expands). After the skyline set begins to populate, 
the heap begins to shrink, because the insertions of 
child routing/ground entries into the heap are com- 
pensated by removals of the dominated MDDRs. We 
call the second phase a reduction phase. 

The expansion phase can be shortened by a utiliza- 
tion of the pivot-skyline filtering, however, the impact 
of the regular skyline objects is much greater - they 
dominate much more objects/MDDRs due to their 
lower hi distances. 

As we show in the experimental evaluation, the 
expansion phase takes 80-90% of the time, when mea- 
suring the time as the proportion of distances com- 
puted so far to the total number of distance compu- 
tations. When measuring the time in terms of heap 
operations, the expansion phase takes 25-75% of the 
time, while the heap size is maximal right before the 
reduction phase begins. 

3.5.1 Partial Metric Skyline 

As the set of already determined skyline objects can 
only grow, it is easy to adopt the metric skyline algo- 
rithm to provide partial metric skyline queries, where 
the user specifies only a limited number of skyline ob- 
jects she/he wants to retrieve. The algorithm simply 
terminates as soon as the specified number of skyline 
objects appears in the skyline set. 

Unfortunately, due to the runtime properties de- 
scribed above, the performance of partial metric sky- 
line query does not scale well with the number of de- 
sired skyline objects. Even a single retrieved skyline 
object requires 25-75% of heap operations and 80- 
90% of distance computations, when compared to the 
"full" metric skyline query. 



4 Experimental Evaluation 



We performed an extensive experimentation with the 
three new variants of the PM-tree based metric sky- 
line processing, comparing them against the original 
M-tree based method. Instead of the number of dom- 
i nance checks f as included in the original contribution 
( [Chen fc Lianp008| [2009)), we have observed other 
< : measures ot costs spent by the MSQ processing - 
the number of distance computations, the number of 
operations on the heap, the maximal allocated size of 
the heap, and finally the I/O costs. 

In addition to the absolute numbers presented in 
the figures below, we also relate the number of dis- 
tance computations spent by (P)M-tree MSQ process- 
ing to the costs of MSQ processed by simple sequential 
search, which takes \Q\ ■ |S| distance computations for 
every query. 



4.1 The Testbed 

We have used tw o databases, a 



2008) 



subset of the 
of MPEG7 



CoPhIR database (Falchi et al. 
image features extracted trom images downloaded 
from flickr.com, and a synthetic database of poly- 
gons. The CoPhIR database, consisting of one mil- 
lion feature vectors, was projected into two sub- 
databases, the CoPhIR_12 database, consisting of 
12-dimensional color layout descriptors, and the 
CoPhIR_76 database, consisting of 76-dimensional 
descriptors (12-dimensional color layout and 64- 
dimcnsional color structure). As a distance function 
the Euclidean (L2) distance was employed. 

The Polygons database was a synthetic randomly 
generated set of 250,000 2D polygons, each polygon 
consisting of 5-15 vertices. The Polygons should serve 
as a non-vectorial analogy to clustered points. The 
first vertex of a polygon was generated at random. 
The next one was generated randomly, but the dis- 
tance from the preceding vertex was limited to 10% 
of max. dista nce in the space. We use d the Haus- 
dorff distance ( |Huttenlocher et"aL |1993[ ) for measur- 
ing the distance between two polygons, so here a poly- 
gon could be interpreted as a cloud of points. 

4.2 Experiment Settings 

The query costs were always averaged for 200 met- 
ric skyline queries, while the query examples followed 
the distribution of database objects. As the param- 
eters we observed various database sizes, the (P)M- 
tree node capacities, the number of query examples, 
the size of partial metric skyline, and the number of 
PM-tree leaf pivots. The (P)M-tree node capacities 
ranged from 20 to 40 routing/ground entries, the in- 
dex sizes took 200MB-2GB, the P(M)-tree heights 
were 3-5 (4-6 levels). The minimal (P)M-tree node 
utilization was set to 20% of node capacity. The num- 
ber of PM-tree leaf pivots ranged from 30 to 1000, 
while the number of inner pivots ranged from 15 to 
500. Unless otherwise stated, the number of MSQ 
query examples was 2, the (P)M-tree node size was 
20, the number of leaf pivots was 1000 for CoPhIR 
and 300 for Polygons (the number of inner pivots was 
half the number of leaf pivots) . 

4.3 The Results 

In the first set of experiments, the number of PM-tree 
leaf pivots was increasing. In Figure |9^ the M-tree's 
MSQ got to 17% of distance computations needed by 
simple sequential search on the Polygons database. 
However, for the highest number of pivots the PM- 
tree 's MSQ reduced the M-tree costs by another 35%. 



The heap size required by PM-tree reached only up 
to one third of the heap size required by the M-tree 
(see Figure ^p) . The impact of pivot-skyline filtering 
(the +PSF(+DEF) variants) on the maximal heap 
size was significant. 



In the second set of experiments, the impact of 
(P)M-tree's node size on the distance computations is 
presented for the Polygons database (see Figure 12). 
The performance of M-tree is more or less indepen- 
dent on the node size, while the PM-tree performance 
slightly decreases with increasing node size. 
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Figure 9: Increasing number of pivots on Polygons: 
(a) Distance computations (b) Maximal heap size 



Figure 12: Increasing size of (P)M-tree nodes. 
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The third set of experiments f ocu sed on the in- 
creasing database size. In Figure [13] the results for 
Cophir_76 database are presented. T^he trend of in- 
creasing distance computations is obvious for all MSQ 
processing methods. However, the situation is dra- 
matically different for the number of heap operations 
and the maximal heap size, where the PM-tree+PSF 
beats the M-tree by a factor of 17 in heap operations, 
and by a factor of 7 in the maximal heap size. On the 
other hand, PM-tree+PSF+DEF suffers from a high 
number of heap operations. 



Figure 10: Increasing number of pivots on Cophir_12: 
(a) Distance computations (b) Maximal heap size 

In Figure \l0\ the same situation is presented for 
the Cophir_12 database. The results are even better 
as for Polygons - the number of distance computa- 
tions for PM-tree+PSF+DEF variant was reduced to 
60% of M-tree costs, while the maximal heap size was 
reduced down to 8% of the heap size required by M- 
tree (note the log. scale in Figure 10 }). 

Finally, in Figure [TT] the same situation is pre- 
sented for the high-dimensional Cophir_76 database. 
Because of the high dimensionality, the M-tree per- 
formance was poor - it got to 91% distance compu- 
tations required by simple sequential search (see Fig- 
ureflTk). The PM-tree performed better, achieving 
75%oi the sequential search's distance computations. 
In Figure |llb the number of heap operations is pre- 
sented. Here the PM-tree+PSF+DEF variant per- 
forms poorly, because of the deferred heap process- 
ing, i.e., rep eate d insertions of MDDRs into the heap 
(see Section 3.3 1. On the other hand, the +DEF vari- 
ant steadily achieves the lowest distance computation 
costs (as expected). The PM-tree+PSF variant per- 
forms the best, achieving 25% of the heap operations 
spent by M-tree. 
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Figure 11: Increasing number of pivots on Cophir_76: 
(a) Distance computations (b) Heap operations 
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Figure 13: Increasing size of Cophir_76 database: 
(a) Distance computations (b) Maximal heap size (c) 
Heap operations 

In the fourth set of experiments, the processing of 
partia l metric skyline queries (as discussed in Section 
3.5.1) is presented, where the results for increasing 
number of desired skyline objects ar e pres ented (see 
Figure 14). As mentioned in Section 3.5.1 the num- 
ber of distance computations spent on retrieving the 
first skyline object is almost as expensive as retriev- 
ing the entire metric skyline. The situation is slightly 
better for the number of heap operations, where the 
M-tree and PM-tree variants are relatively cheaper 
when retrieving the first skyline object. On the other 
hand, the costs of PM-tree+PSF are constant and 
very low (17% of the M-tree costs). 

In the fifth set of experiments, the results for 
increasing number of query examples used in met- 
ric skyline queries are presented on the Cophir_12 
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Figure 14: Increasing number of retrieved skyline ob- 
jects: (a) Distance computations (b) Heap operations 
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Figure 16: Increasing number of pivots: (a) I/O costs 
(b) I/O costs vs. distance computations 



database, see Figure 15 Because the number of sky- 
line objects grows substantially with the increasing 
number of query examples (retrieving 50, 400, 1750, 
4570 skyline objects for 2-, 3-, 4-, and 5-example 
MSQs), the overall MSQ costs grow substantially as 
well. Nevertheless, the PM-tree MSQ processing is 
still much cheaper than the M-tree in the heap size 
and operations, even for 5 query examples. However, 
note that for 5 query examples the distance compu- 
tations of all the methods come close to the costs of 
simple sequential search. 




Figure 15: Increasing number of query examples: 
(a) Distance computations (b) Maximal heap size (c) 
Heap operations 

Although the I/O costs do not represent a dom- 
inant performance component in similarity searcfj^] 
in the last experiment we present the I/O costs as 
a supplementary result (CoPh IR 12, 2 query exam- 
ples). In particular, in Figure 16 1 we give the num- 
bers of logical seeks^] spent by skyline processing (the 
seek operation is the most expensive one when fetch- 
ing a page/PM-tree node from the disk). The PM- 
tree based MSQ processing spent just 64% of seek op- 
erations required by the M-tree. As for the distance 
computation costs, also the I/O costs were decreasing 
with increasing number of pivots. 

In Figure 16 :> the I/O costs vs. computation costs 
are shown. As in the first chart, the pairs (I/O costs, 
distance computations) were obtained for different 
numbers of pivots employed by PM-tree. Since the 
(P)M-tree indices consisted of 79,584 nodes, note that 
the I/O costs correspond to fetching 55% of all the 

3 A single distance computation is generally supposed to be 
much more expensive than a single I/O operation. 

4 We did not consider any node caching in this experiment. 



index nodes for M-tree and 35% for PM-tree (1000 
pivots). Also note there is linear correlation between 
the distance computations and I/O costs. 55% 

4.4 Summary 

The experimentation with M-tree and PM-tree based 
metric skyline processing has shown that the PM-tree 
outperforms the M-tree implementation up to 2 times 
in the number of distance computations, almost 20 
times in the number of heap operations and the max- 
imal heap size, and almost 2 times in the I/O costs. 
The results for maximal heap size are exceptionally 
important, because a large size of the heap (which is a 
main-memory structure) would prevent from process- 
ing of metric skyline queries on very large databases. 

5 Conclusions 

In this paper we have proposed a PM-tree based im- 
plementation of metric skyline query, a recently pro- 
posed multi-example query concept suitable for ad- 
vanced similarity search in multimedia databases. We 
have shown that the PM-tree based implementation 
of metric skylines significantly outperforms the ex- 
isting M-tree based implementation in all observed 
costs - the time, space, and I/O costs. We have 
also discussed and experimentally evaluated the per- 
formance of partial metric skyline processing, where 
only a limited user-defined number of skyline objects 
is retrieved. 
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